Detecting Errors within a Corpus using Anomaly Detection

نویسنده

  • Eleazar Eskin
چکیده

We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus. 1 I n t r o d u c t i o n Manually marking corpora is a time consuming and expensive process. The process is subject to human error by the experts doing the marking. Unfortunately, many natural language processing methods are sensitive to these errors. In order to ensure accuracy in a corpus, typically several experts pass over the corpus to ensure consistency. For large corpora this can be a tremendous expense. In this paper, we propose a method for automatically detecting errors in a marked corpus using an anomaly detection technique. This technique detects anomalies or elements which do not fit in with the rest of the corpus. When applied to marked corpora, the anomalies tend to be errors in the markings of the corpus. To detect the anomalies, we first compute a probability distribution over the entire corpus. Then we apply a statistical test which identifies which elements are anomalies. In this case the anomalies are the elements with very low likelihood. These elements are marked as errors and are thrown out of the corpus. The model is recomputed on the remaining elements. At conclusion, we are left with two data sets: one the normal elements and the second the detected

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Anomaly-based annotation errors detection in TTS corpora

In this paper we adopt several anomaly detection methods to detect annotation errors in single-speaker read-speech corpora used for text-to-speech (TTS) synthesis. Correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. ...

متن کامل

Voting Detector: A Combination of Anomaly Detectors to Reveal Annotation Errors in TTS Corpora

Anomaly detection techniques were shown to help in detecting word-level annotation errors in read-speech corpora for textto-speech synthesis. In this framework, correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. In ...

متن کامل

Towards Detecting Annotation Errors in Spoken Language Corpora

The issue Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), only recently has there been some work in detecting errors in synt...

متن کامل

High-Order Sequence Modeling for Language Learner Error Detection

We address the problem of detecting English language learner errors by using a discriminative high-order sequence model. Unlike most work in error-detection, this method is agnostic as to specific error types, thus potentially allowing for higher recall across different error types. The approach integrates features from many sources into the error-detection model, ranging from language model-ba...

متن کامل

Separation Between Anomalous Targets and Background Based on the Decomposition of Reduced Dimension Hyperspectral Image

The application of anomaly detection has been given a special place among the different   processings of hyperspectral images. Nowadays, many of the methods only use background information to detect between anomaly pixels and background. Due to noise and the presence of anomaly pixels in the background, the assumption of the specific statistical distribution of the background, as well as the co...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000